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[57] ABSTRACT 

A computer controlled display system providing for graphi- 
cal representation of a query to a database and creation and 
traversal through a search history. A database search is 
typically performed by a sequence of narrowing queries. 
Each narrowing query is performed in a query window. A 
query window is comprised of an input area for entering 
query expressions, an query results display area, an indicator 
of a search scope associated with the query window and a 
history indicator area. A suitable irifonnadon visualization 
technique is used to graphically display the search results in 
the query results display area. From these visualizations, 
new search scopes and query windows are created. A search 
path comprising the query windows for the current search 
path are displayed at any instant of time of the search. A 
history mechanism provides for ready traversal through the 
search history. 

18 Claims, 8 Drawing Sheets 
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METHOD AND APPARATUS FOR 
CONCURRENT GRAPHICAL 
VISUALIZATION OF A DATABASE SEARCH 
AND ITS SEARCH HISTORY 

5 

FIELD OF THE INVENTION 

The present invention relates generally to the field of 
information visualization, and more particularly, graphical 
visualization of a database search. 10 

BACKGROUND OF THE INVENTION 

More and more information is being made available to 
computer system users via various mediums such as CD- 15 
ROMs, on-line databases and the like (collectively referred 
to as databases). A query to a database typically requires a 
complex textual specification based on keywords and logical 
relationships between sets of information. In most instances, 
the query returns only the results. Often, the results are not 20 
useful either because the results are much larger than thai 
which can be easily visualized and manipulated, or because 
the result is unexpectedly empty. 

When performing a search, it is typical that a search 
strategy will be used in order to find the desired information. 25 
Most search strategies are premised on attaining a reason- 
able number of items that satisfy a search criteria. Typically* 
a query is comprised of keywords (i.e. search terms) con- 
nected together via Logical and/or Proximity Operators. 
Logical Operators are used to include or exclude items in a 30 
set whereas proximity operators are used to identify items 
having keywords that are a ptedeterrnined distance apart 
(such as within 10 word or that are adjacent). Once a query 
is made and executed, a list of items satisfying the criteria 
of the query is presented to the user. The user can then either 35 
view one or more items in the list, or if the list is large, 
modify the search to reduce the number of items in the list 

One prior art system, the LEXIS Information retrieval 
system, allows queries to be performed according to various ^ 
levels. Each subsequent level contains a subset of the results 
of the immediately prior level, based on user provided 
search criteria. The LEXIS system provides text based 
feedback which indicates the number of items found which 
satisfy the search criteria. The user then has various options 
to view the list of items found (e.g. full text, keyword in 
context, segments or as a list of citations.) 

A second prior art system is the DIALOG information 
retrieval system. In DIALOG, query results can be struc- 
tured so that feedback is provided as to the number of items 50 
found which satisfy each: keyword. Queries may also be 
combined to create-new queries. However, the user must 
track the queries made in order to make effective use of these 
facilities. 

When performing searches, it may also be desirable to be 55 
able to restart searches at a point in the middle of a search 
path. In the aforementioned LEXIS System, this is accom- 
plished by specifying and modifying a prior search level. 
This has the drawback in that it entirely replaces the prior 
search level and all search level below the level modified. In 60 
the Dialog system this can be done, but is left to the user to 
map out the query history according to the taken search 
sequence. No mechanism is provided to the user to accom- 
modate this. Thus, it would be desirable to have a system 
that is capable of creating a search history through which a 65 
user may restart searches at designated points without 
destroying the results of any prior searching. 
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Further materials relevant to present invention include: 

EP 0 535 986 A2, entitled "Method of Operating A 
Processor", Robertson, which is assigned to the assignee of 
the present invention describes a method for centering a 
selected node of a node link structure along a centering line. 
The nodes are in rows, and each row extends across a 
centering line with links between nodes in adjacent rows. 
When a user requests a centering operation for an in fUr^tftf 
node, a sequence of images is presented, each including a 
row that appears to be a continuation of the row with the 
indicated node and that includes a continued indicated node 
that appears to be a continuation of the indicated node. The 
rows appear to be shifted, bringing the continued indicated 
nodes toward the centering line, until a final shift locks the 
continued in dica t ed node into position at the centering line. 
The positions of the indicated node and a subset of the 
continued indicated nodes together can define an asymptotic 
path that begins at the position of the indicated node and 
approaches the center line asymptotically until the final shift 
occurs. The displacements between positions can follow a 
logarithmic function, with each displacement being a pro- 
portion of the distance from the preceding position to the 
centering line. Each node can be rectangular, and the nodes 
in each row can be separated by equal offsets to provide 
compact rows. Each node can be a selectable unit, so that the 
user can request a centering operation by selecting a node, 
such as with a moose click. 

EP 0447 095A, Robertson, et aL, entitled "Workspace 
Display", which is assigned to the assignee of the present 
invention discloses a processor which presents a sequence of 
images of a workspace that is stretched to enable the user to 
view a part of a workspace in greater detail. The workspace 
includes a middle section and two peripheral sections that 
meet the middle section on opposite edges. Each of the 
sections appears to be a rectangular two-dimensional surface 
and they are perceptible in three dimensions. When the user 
is viewing the middle section as if it were parallel to the 
display screen surface, each peripheral section appears to 
extend away from the user at an angle from the edge of the 
middle section so that the peripheral sections occupy rela- 
tively little of the screen. When the user requests stretching, 
the middle section is stretched and the peripheral sections 
are compressed to accommodate the stretching. When the 
user requests destretching* the middle section is destretched 
and the peripheral sections are decompressed accordingly. 

Furnas, G. W., "Generalized Fisheye Views," CHI '86 
Proceedings, ACM, April 1986, pp. 16-23, describes fisheye 
views that provide a balance of local detail and global 
context Section 1 discusses fisheye lenses that show places 
nearby in great detail while showing the whole world, 
showing remote regions in successively less detail; a cari- 
cature is the poster of the "New Yorker' s View of the United 
States." Section 3 describes a degree of interest/DOI) func- 
tion that assigns to each point in a structure, a number telling 
how interested the user is in seeing that point, given the 
current task. A display can then be made by showing the 
most interesting points, as indicated by the DOI function. 
The fisheye view can achieve, for example, a Inparitlimically 
compressed display of a tree, as illustrated by FIG. 4 of 
Furnas for a tree structured text file. Section 4 also describes 
fisheye views for botanical taxonomies, le?al codes, text 
outlines, a decisions tree, a telephone area code directory, a 
corporate directory, and UNIX file hieraichy listings. Sec- 
tion 5 indicates that a display-relevant notion of a priori 
importance can be defined for lists, trees, ac/lic directed 
graphs, general graphs, and Euclidean spaces, unlike the 
geographical example which inspired the metaphor of the 
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"New Yorker's View;** die underlying structures need not be 
spatial, nor need the output be graphic. FIG. 6 of Furnas 
shows a fisheye calendar 

Spoeni, Anschny *TnfoC rystal: A visual tool for informa- 
tion retrieval", MTT-CEIT-TR 93-3, describes with reference 
to a FIG. I, how to transform a Venn diagram into an iconic 
display which represents all possible Boolean queries 
involving its inputs in a normal form. The Venn diagram is 
first exploded into its disjoint subsets. The subsets are then 
represented by icons whose shapes reflect the number of 
criteria satisfied by their contents (also called the rank of a 
subset) Finally, the subset icons are surround by a border 
area that contains criterion icons that represent the original 
sets. Visual coding principles that are incorporated include 
(1) shape coding to indicate the number of criteria that the 
contents associated with an interior icon satisfy, (2) prox- 
imity coding to indicate that the closer an interior icon is 
located to a criterion icon, the more likely it is that the icon's 
contents are related to it, (3) rank coding to indicate how 
many criteria are satisfied, (4) color or texture coding to 
indicate which particular criteria are satisfied by the icon's 
contents, (5) orientation coding so that the sides of an icon 
are positioned so that their sides face the criteria they satisfy, 
and (6) size or brightness & saturation coding to indicate the 
number of elements represented by an icon. Section IX 
describes a Visual Query language wherein the output of an 
InfoCrystal is defined as a set of selected interior icons. HO. 
3 illustrates how the InfbCrystals can be "chained together** 
to form a hierarchical query structure. 
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SUMMARY OF THE INVENTION 

A computer controlled display system providing for 
graphical representation of a query to a rigfahq^g and cre- 
ation and traversal through a search history is disclosed. In 
the present invention, the results of a query to a database are 
graphically displayed in a query window using a suitable 
information visualization technique. The information visu- 
alization causes the display of the query results as one or 
more disjoint and selectable graphical regions relative to a 
search scope (e.g. a Venn diagram situated on a plane). The 
query window is further comprised of an input area for 
entering query expressions, an indicator of a search scope 
associated with the query window and a history indicator 
area. The history indicator area contains icons identifying 
siblings within a search leveL The query windows in a 
particular search path are displayed as concentrically nested 
to provide a visual cue as to the relationship of the query 
windows. Where a query window is in die nesting in dicat es 
it's level in the seanir history. The nesting further provides 
for easy traversal through that search path can be accom- 
plished in a point and click fashion. New query windows are 
created by definition of a new search scope based on search 
results, 

The present invention further provides a search history 
mechanism for facilitating traversal through the search his- 
tory. One aspect of the search history mechanism is embod- 
ied in the history indicator areas in each of the query 
windows. The alignment of the query windows and their 
corresponding indicators reveal a branch of the search 
history. Traversal to particular points in the path is enabled 
by clicking on the an icon associated with the desired query 
window (or search scope). A second aspect of the search 65 
history mechanism is the provision of a history windows for 
displaying the search history in a tree format 
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DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a computer based system 
upon which the currently preferred embodiment of the 
present invention may be implemented. 

FIG. 1 is a flowchart illustrating the basic steps in a 
database search which result in the creation of query win- 
dows and a search history tree, as may be performed in the 
currently preferred enmodunent of the present invention. 

FIG. 3 is an example of a search history tree as may be 
created in the present invention. 

HG. 4 illustrates a query window as may be utilized by 
the currently preferred embodiment of the present invention. 

FIG. 5a illustrates a first configuration of nested query 
windows illustrating a first search path of the search history 
tree of FIG. 3, as may be utilized by the currently preferred 
embodiment of the present invention. 

FIG. Sb illustrates a second configuration of nested query 
windows illustrating a second search path of the search 
his tory tree of FIG. 3, as may be utilized by the currently 
preferred embodiment of the present invention. 

HG. 5c illustrates a third configuration of nested query 
windows iUustrating a third search path and said first search 
path of the search history of FIG. 3 displayed in the same: 
query window, as may be utilized by the currently preferred 
embodiment of the present invention. 

FIG. 6 illustrates a history window displaying the search; 
history tree of HG. 3, as may be utilized by the currently 
preferred embodiment of the present invention. 

HG. 7 illustrates a verm diagram visualization for a query 
window as may be used the currently preferred embodiment 
of the present invention. 

HG. 8 illustrates the query window of HG. 7 after an 
executed query showing a graphical visualization of the 
query results using a verm diagram 

HG. 9 illustrates an update of the query window of HG. 
7 after an executed query showing a graphical visualization 
of the query results using a verm diagram where one 
expression of the query has no elements in common with the 
other expressions. 

HG. 10 illustrates a perspective wail intplementation of a 
query window in the currently preferred embodiment of the 
present invention. 

HG. 11 illustrates the query window of HG. 10 after an 
executed query showing a graphical visualization of the 
query results using a perspective wall. 

DET AILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

A computer controlled display system for graphically 
displaying the results of a query to a database is disclosed. 
In the following description numerous specific details are set 
forth, such as the operational aspects of a database, in order 
to provide a thorough understanding of the present inven- 
tion. It would be apparent, however, to or* skilled in the art 
to practice the invention without such specific details. In 
other instances, specific inmlemeotation details, such as 
software coding techniques for creating graphical objects, 
have not been shown in detail in order not to unnecessarily 
obscure the present invention. 

The term database as used herein refers to any body of 
information that is accessible via a computer based system. 
The body of information would typically be a rffltphflsp 
located on a storage medium directly connected to the 
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computer based system (e.g. on a CD-ROM) or accessible 
via a network (e.g. an oo-line information source). Alterna- 
tively, the body of information could be a collection of 
documents or document parts managed by a document 
management system. In any event, such databases can be 5 
characterized as having three primary parts: the information 
or data itself, a retrieval/updating pan and a user interface 
part. The retrieval/updating part enables access to the infor- 
mation for retrieval, editing, or addition of information, the 
user interface part is the mechanism by which a user 10 
interacts with the database to search for and obtain infor- 
mation. It is this user interface part to which the present 
invention is directed. 

As used herein, the term search refers to the steps per- 
formed for retrieval of information from a database. The 15 
term document refers to the specific items of information 
contained in the database. Documents include textual, audio 
or visual works. The term search scope refers to the set of 
information in the database which may be retrieved at any 
instant during the search. As will become apparent in the 20 
foregoing description, a search scope is created through 
selection of subsets of the results of a query. The search 
scope is narrowed as the various queries in the search are 
performed. The term query refers to a set of parameters 
provided for executing a step in a search. This set of & 
parameters is typically in the form of one or more expres- 
sions. Expressions are predicates such as keywords or sets of 
keywords, dates, numbers and other data types, as well as 
various combination thereof, combined with logical or prox- 
imity operators which define the documents of interest. 30 

Overview of a Computer Based System In the 
Currently Preferred Embodiment of the Present 
Invention 



35 

The computer based system on which the currently pre- 
ferred embodiment of the present invention may be imple- 
mented is described with reference to FIG. 1. Referring to 
FIG. 1, the computer based system is comprised of a 
plurality of components coupled via a bus 101. The bus 101 40 
illustrated here is simplified in order not to obscure the 
present invention. The bus 101 may consist of a plurality of 
parallel buses (e.g. address, data and status buses) as well as 
a hierarchy of buses (eg. a processor bus, a local bus and an 
I/O bus). In any event, the computer system is further * 5 
comprised of a processor 102 for executing instructions 
provided via bus 101 from either static Read Only Memory 
(ROM) 103 or dynamic Random Access Memory (RAM) 
104. The processor 102, ROM 103 and RAM 104 may be 
discrete components or a single integrated device such as an 50 
Application Specification Integrated Circuit (ASIC). 

It should further be noted mat the processor 102 is used 
to execute instructions coded in a suitable programming 
language for creating the graphical data used to create the J5 
query and history windows and the various visualizations 
described herein. The processor 102 would also process the 
queries associated with a database search. Moreover, such 
instructions would be used for causing the steps outlined in 
FIG. 2 to be performed. 60 

Also coupled to the bus 101 are a keyboard 105 for 
entering alphanumeric input, a cursor control device 106 for 
manipulating a cursor, a display 107 for displaying visual 
output, and a fixed disk 108 for storing data (e.g. the 
database). The fixed disk 108 may be a magnetic or optical 65 
disk. Further coupled to the bus 101 is a removable disk 109 
and a network connection 110. The removable disk 109 may 



^ a „^PPy ^ dnve, or as optical disk drive such as a 
CD-ROM drive which itself may be a database). The net- 
work connection 110 represents either a local area network 
coupling or a public network coupling. In any event, the 
network connection 110 provides for coupling to databases 
residing on the network. 

The representation of the results of a query will be 
displayed on display 107 and the user will interact with the 
computer controlled based system through a combination of 
the keyboard 105 and the cursor control device 106. 

While the preferred embodiment of the present invention 
is embodied on a computer based system, the present 
mvenuon could be practiced on any computer controlled 
display system, such as a fixed function terminal The 
currently preferred embodiment of the present invention is 
implemented on a computer controlled display system hav- 
ing a Graphical User Interface (GUI) which allows multiple 
concurrent 'Vindows" to be in operation (e.g. one of the 
femily of Macintosh Computers available from Apple Com- 
puter, Inc. of Cupertino Calif.). A ''window" refers to a 
visual representation of an executing task. Windows and 
operation thereof is well known in the art, so no further 
discussion of windows or their operation is deemed neces- 
sary. Such a GUI win also support operations such as 44 point 
and click-. A "point and click" operation is one where a 
cursor on a display screen is positioned over a desired 
portion of the display, such as an icon, using a cursor control 
device such as a mouse or trackball. Once the cursor is 
appropriately positioned, a button/switch ^enrnq^ with ^ 
cursor control device is quickly depressed and released. This 
creates an electrical signal which causes a predetermined 
operation to occur. Other operations may require a "double 
click" where the button/switch is depressed and released 
rapidly, twice in succession. 

Operational Flow and Creation of a Search History 

FIG. 2 is a flowchart illustrating the basic steps of 
performing a search and the resulting creation of the history 
tree structure in the currently preferred embodiment. 
Accordingly, certain steps, such as selection of the desired 
visualization for the search results, are not described. Refer- 
ring to FIG. 2, a query window is created having the entire 
database as a search scope, step 201. Referring briefly to 
FIG. 3, this is represented as root node 320. Referring back 
to FIG. 2, a search is then performed based on one or more 
search expressions, step 202. The search expressions pro- 
vided would be according to the search rules of the particular 
data retrieval mechanism associated with die database. For 
example, if the database was based on a relational database, 
the search expressions would be based on relationships 
between elements of known categories of data. The query 
will be processed, the desired visualization determined and 
a graphical visualization of the search results is displayed in 
the query window, step 203. As the present invention may 
incorporate various visualization techniques, the user will 
have selected the desired visualization either before or after 
execution of the query. Alternatively, the visualization may 
be automatically selected based on the query. In any event, 
a user must then determine if they are satisfied with the 
search results (i.e. the results are o.k.), stcj» 204. If they are 
satisfied, the search to this point is compel and the user 
would select documents which would be 1 tf.iiv.ved for view- 
ing, step 205. If they are not satisfied, t^:; user must then 
formulate a subsequent search strategy, "t rn options are to 
formulate a new query with the same search scope, step 206, 
create a new search scope, step 207 or to traverse back to a 
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prior search scope, step 210. For the option of formulating 
a new query to die same search scope, the query is obtained 
and the search performed per step 202. For the option of 
defining a new search scope, the user selects the new search 
scope from the results of the last executed query. As will be 
explained in greater detail below, this will typically be done 
by a point and click operation on the visualization of the 
search results within the query window. The subsets of the 
search results are determined, step 208 and a new query 
window is created inside the previous query window, step 
209. This creating within a prior query window causes a 
"concentric nesting'* of the query windows to provide a 
visual cue as to the relationship of the windows. Once the 
new search scope is created, queries are executed per step 
202. Referring briefly to FIG. 3, this step is performed far 
creation for all the nodes except the root node 320. 

If the user decides to traverse back to a previously defined 
search scope per step 210, the user must select the desired 
search scope. As will become apparent in the description 
below, the present invention provides two ways to accom- 
plish this, via the history indicators in the query window of 
the concurrently displayed query windows or via the history 
structure displayed in the history window. In any event, once 
the desired search scope is selected, the display is updated 
to present the query windows corresponding to the path of 
the se le cted search scope, step 211. New queries would then 
be executed, per step 202, based on the selected search 
scope. 

Referring back to FIG. 3, it is presumed that the steps 
described in FIG. 2 are used to add the nodes 321-326. It 
should be noted mat the nodes 321 and 322 are on the same 
level and have the same parent node. The nodes 323-326 are 
on the same level but the nodes 323-325 have a different 
parent than node 326. Thus, nodes 323-325 are said to be on 
different search paths than node 326. As will be described 
below, when displaying a search path, the query window for 
a node for each level in the path is displayed. 

Query Windows 

The structure of a query window in the currently preferred 
embodiment is illustrated with respect to FIG. 4. The term 
query window is meant to refer to the elements of the visual 
interface and does not limit the spirit and scope of the 
present invention. Examples of different implementations of 
a query window are provided in the description of the visual 
representation of query results provided below. Referring to 
FIG. 4, a query window 401 has a query input area 402, a 
results display area 403 and a history indicator area 404. The 
query input area 402 is where a user inputs a query. As 
described above, a query is comprised of one or more 
expressions. The composition of the query input area will 
depend on the type of visualization desired 

The results display area 403 will display a graphical 
visualization of the search results according to a selected 
information visualization technique. Visualizations such as 
venn diagrams* a perspective wall, a hierarchical represen- 
tation, lists, or tables, may be utilized. The key criteria for 
a graphical visualization is that the results are presented as 
selectable disjoint subsets. The manner in which selection 
occurs will depend on the graphical visualization used. 
Examples of such visualizations and a corresponding selec- 
tion technique are described in greater detail below. 

Also present in the results display area 403 is a search 
scope description 405 for the query window. The search 
scope description 405 is typically a textual description of the 
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search scope of the query window (e.g. a logical organiza- 
tion of the expressions used to achieve the search scope.) 

The history in di c ator area 404 provides a means by which 
a user may visually determine a location within a hierarchi- 

5 cal search path. The history indicator will contain a number 
of indicators representing different created search paths at 
the same search level. The indicators used may be an icon or 
text symbol(s). The indicators) representing the search 
scope being displayed (and the search path) are raghlighted. 

10 Here, an icon 406 (a box) is displayed to indicate that there 
is one search scope defined at this level. Multiple icons may 
be present. Each icon in the history indicator area 404 
corresponds to a search scope. 

Variations of the placement of the described areas are 

15 within the scope of the present invention. For example, the 
history indicator area may ran down a vertical side of the 
query window. This may be desirable if the history structure 
used in a history window was displayed with a horizontal 
orientation (rather than vertical orientation of FIG. 3.) In 
such a case, the levels of the tree would naturally be in 
columns rather than rows. As a result, it would be more 
consistent with the history structure to have the icons 
displayed with a vertical orientation. Moreover, each of the 
various areas may be implemented as separate 'Srindows". 
This would enable flexibility in the display of the query 

23 window. For example, it may be desirable at some point to 
remove a query input area when further query input is not 
needed. 

As noted above, during the course of a HntqKa«^ search, 
query windows corresponding to a direct search path are 

30 displayed in a nested concentric fashion. This is illustrated 
with reference to FIGS. 5a and 5b. FIGS. 5a and 5b illustrate 
two search paths taken within the history structure of FIG. 
3. Referring to FIG. 5a a plurality of query windows are 
displayed. There is a single query window displayed for 

35 each level in the search path. Only the query windows in the 
direct search path are displayed. However, the history indi- 
cator area is used to indicate the number of "sibling" nodes 
along search path that are on the same level. In FIG. 5a three 
query windows are displayed. Query window 501 corre- 

40 sponds to the search scope of node 323, query window 502 
corresponds to the search scope of node 321 and query 
window 503 corresponds to the search scope of node 320. 
For query window 501 there are three nodes in the search 
path at that level. This is indicated by the three boxes/ 

45 501fr-501d) in the history indicator area 501a of query 
window 501. Note that the box 5014 is highlighted to 
indicate the direct search path at this level of the tree. 
Moreover at this point it indicates where this query window 
is in the history structure. Similarly with respect to query 

50 window 502, history indicator area 502a is comprised of 
boxes 5026-502c. The box 5026 is highlighted to indicate 
the direct path for the search scope of query window 501. 

Referring now to FIG. 5b. again three windows 510-512 
are displayed and correspond to the three levels of the tree 

55 structure. Query window 510 corresponds to the search 
scope of node 326, query window 511 corres pon ds to the 
search scope of node 322 and query window 512 corre- 
sponds to the search scope of node 326. For query window 
510 there is only one node at that level of the search path so 

60 there is only one box 5106 in history indicator area 510a. For 
query window 511, there are two nodes at that level of the 
search path, indicated fay boxes 511^-51 1c of history area 
511a. In this case, the box 511c is in the search path, so it 
is highlighted. Finally, the box 512b of tlie history area 512a 

65 is highlighted since it is in the search path. 

FIG. 5c illustrates windows 520-522 which correspond to 
the three levels of the tree structure. Query window 520 
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corresponds to the search scope of nodes 323 and 324 quay 
window 521 corresponds to the search scope of node 321 
and query window 522 corresponds to the search scope of 
node 320. Here, two search scopes at the same level are 
concurrently displayed. Tins is indicated by the highlighting 
of the boxes 520a and 520*. The results display area of 
query window 520 is divided into areas 523* and 523* each 
corresponding to a search scope. Such a display may be used 
for executing a query against both search scopes simulta- 
neously. 

History Window 

The present invention combines the graphic visual rep- 
resentation of the results of a query with a means for 
m a int ain in g and traversing a search history. The search 
history is used in the event that subsequent queries do not 
provide the desired results and it is desirable to restart at a 
convenient starting point The search history in the currently 
preferred embodiment is generated in a hierarchical tree 
structure wherein each node represents a search scope. In 
this structure, a child node will represent a subset of the 
scope of the parent node. A user determines when a new 
search scope and resulting node is created. The entire search 
history is presented in a history window. 

The history window is independent of the query windows. 
A history window is illustrated in FIG. 6. Referring to FIG. 
6, the tree structure of FIG. 3 is shown on history window 
601. The present invention provides for direct movement to 
the various query windows from the history window. This is 
done in a point and click fashion. By pointing to the node 
representing the desired query window and clicking on the 
cursor control button, the desired query window can be 
displayed to the user Other functions such as deleting query 
windows can be performed form the history tree. 

Techniques for graphical creation, representation and 
manipulation of tree structures are known in the art. Any 
such techniques could be implemented for use with the 
present invention. 

40 

Visual Representation of Query Results 

The present invention provides a graphical display output 
which allows a user to visualize the results of a query to a 
database beyond a mere list format Query results are 
graphically presented as selectable disjoint subsets. Two 
visualizations describe d below are the verm diag r am andlh e 
perspective walL Auscr may choose whtch^tnali^rinn j s 
used in connection with a particular query window, or the 
visualization may be selected automatically based on the 
query. Al tentatively, a user may wish to have the same query 
results displayed using the various visualization techniques. 
Choosing which visualization to use may occur either before 
or after the query is executed. Each of these visualizations 
and a corresponding query window are now described. 

Venn Diagrams 

A Venn diagram uses circles to represent sets of data. 
Position and overlap of the circles indicate logical relation- 
ships between the sets of data. A Venn diagram based 60 
visualization is premised on the number of dimensions 
supported by the computer controlled display system. This is 
because no more than n+1 mutually intersecting sets may be 
readily displayed in n-space. So for example if the computer 
controlled display system generates grapfceal information in 65 
an n=2 space, (i.e. two dimensions) the number of sets or 
expressions that may be visualized is limited to three. 
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JJfeturally, if n=3, up to 4 expressions or sets may be utilized. 
The currently preferred embodiment illustrates the case of 
n=2. However, this does not limit the number of expressions 
that can be used for a search since queries can be nested. 

Query windows embodying a verni diagram impieirtenta- 
tton are illustrated in FIGS. 7-9. The creation of the query 
windows illustrated in FIGS. 7-9 may be done utilW 
programming tools or toolkits generally available to appli- 
cation program developers. FIG. 7 illustrates a query win- 
dow prior to a query being made. Referring to FIG 7 a 
query window 701 includes a description of the search sco'pe 
Z 2 ««* » plurality of Query input areas 704-71* \ t should 
benoted that each of the query input areas 704-706 has a 
different visually distinctive attribute. This visually distinc- 
tive attribute may be a color or a fill pattern. In FIG 7 the 
query input area 704 has a vertical lines fill pattern, the query 
input area 705 has a horizontal lines fill pattern and the query 
input area 706 has a right slanted lines fill pattern. Because 
each of these queries is visually distinctive, the results of the 
various expressions in a query may be readily determined. 
Finally, the query window 701 includes history indicator 
area 703a The history indicator area 703a is used to indicate 
a location (i.e. level) in the query history. The visual 
appearance of box(es) (e.g. box 7036) displayed the history 
indicator area 703a will also be an indicator of whether the ' 
corresponding search scope is in the current path of a query. .£ 

When querying the database, a user will enter expressions 
of the query into query input areas 704-705. The query : 
would then be executed using a mechanism such as depress- 
ing the enter or return key of a keyboard coupled to the 
computer controlled display system, via a menu item, or via 
some switch or button (e.g. an execute button invoked by a 
point and click function) in the query window itself. FIG 8 
is a screen display ill tiering the results of n r *ry' 

2w * mU 86000(1 811(1 mird scarcb expressions 

901-803 have been entered into query input areas 704-706, 
respectively. As a result of executing the query, a textual 
description of the results of the query for each of the 
expressions 804, is provided. This may include the actual 
number of elements responsive to the search query. 

Further included in the query window is a results display 
area displaying a Venn diagram. In FIG. 8, the Venn diagram 
includes circles 805-407, which correspond to the results of 
search expressions 801-803, respectively. The number nf 
documents satisfying me search expr^™ » f Ti^' 
sponding circle is indicated by its size, by a number con- 
tamedin _the circle, or both. Each of "the circles may also 
c ontain a list of or some icomc representation of the docu- 
ments satisf ying the corresponding search expresaionHHs 
significant to note that the resultant circles of tEe*Venn 
diagram has the same fill pattern a s found in the corr espond- $&nz$w 
mgjpiervjnput^^: This allows a quick visual interpreta- 
fion of how the query expressions relate* 

The disjoint selectable subsets of the venn diagram visu- 
alization are further illustrated in FIG. 8. Note that overlap 
area 808 indicates the intersection of circles 805 and 806, the 
overlap area 809 indicates the intersection of circles 805 and 
807, the overlap area 810 indicates the intersection of circles 
806 and 807, and the overlap area 811 indicates the inter- 
section of each of the circles 805-807. Each of the areas of 
the circles that do not overlap another ciicte are exclusive 
from the other circles (i.e. mere are na^tuiniOBJesuUs). 
Each of these areas, as well as an entire circle, would 
constitute a selectable disjoint subset. 

FIG. 9 illustrates the results of a di ffcrent q jcry. Referring 
to FIG. 9, a set of expressions 901-903 arc entered into the 
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query input areas 704-706 respectively. The resulting circles 
904-906 of the Wcon diagram correspond to the expressions 
901-903 Rom the resulting Verm diagram in FIG. 9* it can 
be readily observed that no items in the Hutftbasft in circle 
906 are in any of the other circles. This could be a desired 
or undesired result In any event it is clear that the rela- 
tionships of the results between the different expressions can 
be quickly ascertained. 

In FIG. 9 the disjoint subsets are the overlap area 907, the 
portion of circle 904 that does not overlap with circle 905, 
the entire circle 904, the portion of circle 905 that does not 
overlap with circle 904, the entire circle 905 and the entire 
circle 906. 

Selection of a new search scope is a straightforward task. 
A user would simply point and click to the disjoint subsets 
in the results display to include those items are included in 
the new search scope. Selection of each of the overlap areas 
would be for the items satisfying the logical relationship 
indicated by the overlaps. Selection techniques for getting 
either all of what is in a particular circle, or the part of a 20 
circle not overlapping with another circle would be rela- 
tively straightforward. For example, a double click in a 
non-overlapping portion of a circle could be used to indicate 
selection of the entire circle, whereas a single click could 
indicate only the portion of the circle that does not overlap 
with another circle. 
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The perspective wall visualization permits a user to lay 
out search results along a linear property, such as date or 
version. The perspective wall is described in EP 0447 095A, 
Robertson, et aL, entitled "Workspace Display**. In the 
perspective wall visualization, database search results are 
organized along two user defined axes. So for example, a 
user may select date as a horizontal axes and author as a 
vertical axes. Items associated with a particular author are 
then laid out along the wall in time order. A query window 
for the perspective wall visualization is illustrated in FIG. 
10. Referring to FIG. 10, the results of a query are organized 
onto perspective wall 1001 in the result display area 1002 of 
query window 1003. The results of a query are mapped to 
the wall according to user provided criteria (defined below). 
Traversal along the wall is invoked by scrolling to different 45 
parts of the wall or by "stretching" and "destreching" which 
is described in EP 0447 095A. The input area 1004 is 
different in that the intent of the perspective wall visualiza- 
tion is to filler our portions of the search scope and then 
organize along a linear property. So a first input box 1005, 
labeled linear property, is for identifying a property of the 
search scope to which the subsequent search results are laid 
out. This can be though of as the horizontal axis property. A 
second input box 1006, labeled grouping property, is for 
identifying a property of the search results by which the data 
linearly grouped. This can be through of a vertical axis 
property. A third input box 1007 is for providing one or more 
search expressions. 

FIG. 11 is an example of a perspective wall visualization. 
Referring to FIG. U, the wall is comprised of a plurality of 60 
results laid out on the wall in a time and author fashion. Note 
that in the first input box 1005, the property "date" has been 
entered and in second input box 1006, the grouping property 
"author" has been entered. An expression "expression PW* 
has been entered into the third input box. On the perspective 
wall visualization, the documents satisfying the query cri- 
teria for authors "Hoppe", "Rao" and "Mackinlay" are laid 
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out accordingly in time fashion. For the author "Hoppe" 
there are three documents dated Jan. 1 (as inrffcatetf at 1101) 
and two documents dated Jan. 30 (as indicated at 1102). The 
visualization indicates that there are other documents asso- 
ciated with the author "Hoppe", but the perspective wall 
must be traversed to get detail on those documents. 

For the author "Rao", a single document is dated Jan. 1 (as 
indi cated at 1103), two documents dated Jan. 15 (as indi- 
cated at 1104) and three documents dated Feb. IS (as 
indicated at 1105). The visualization indicates that there are 
other documents associated with the author "Rao", but the 
perspective wall must be traversed to get detail on those 
documents. 

For the author "Mackinlay**, a single document is dated 
Jan. 1 5 (as i n ch cated by U 06), two documents are dated Jan. 
30 (as indicated by 1107), two documents are dated Feb. 1 
(as indicated by 1108) and a single document is dated Feb. 
15 (as indicated by 1109). The visualization indicates that 
there are other documents associated with the author "Mack- 
inlay", but the perspective wall must be traversed to get 
detail on those documents. 

For the perspective wall visualization, the disjoint subsets 
would be the space between two points along the defined 
linear property, or the collection of ^n™™*^ satisfying one 
of the grouping properties or a collection of documents, 
satisfying one of the grouping properties and is between two 
points along the defined linear p r o perty. Selection of a 
disjoint subsets to create a new search scope can be accom- 
plished by a point and click operation at a start point on the 
wall and a point and cHck operation at and end point on the 
wall (e.g. between two dates). Of course, the user may 
traverse the wall to get to the desired start and end points. All 
the search results between the two dates would then be part 
ofa new search scope. Selection may also be performed on 
a group* A group may be selected by a point and click 
operation on a label identifying that group. Finally, selection 
can be performed on a group(8) but tor only those documents 
between two points on the wall. This may be accomplished 
by "double-clicking" on the desired group and then single 
clicking for the start and end points on the wall. 

Thus, a computer controlled display system providing for 
graphical representation of a query to a H»taKn<^ and cre- 
ation and traversal through a search history is disclosed. 
While the embodiments disclosed herein are preferred, it 
will be appreciate from this teaching that various alternative, 
modifications, variations or improvements therein may be 
made by those skilled in the art, which are intended to be 
encompassed by the following claims. 

What is claimed: 

1. A computer controlled display system for displaying 
the results of a search for documents stored in a database, 
said computer controlled display system comprising: 

a display for displaying a plurality of query windows; 

means for defining a search scope; 

means for generating a query window responsive to 
definition of a search scope, said query window com- 
prising an input area for input of query expressions, a 
query results area for graphical display of query results, 
a history indicator area for displaying one or more 
search scope indicators, and a search scope area for 
indicating the search scope for the query window, said 
means far generating a query window coupled to said 
display; 

means for entering a query to said database, said query 

comprised of one or more expressions; 
query processing means for processing query expressions 

from said input areas of said query window and causing 
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display of the results of said query according to a user 
selected information visualization technique, said user 
selected information visualization technique causing 
display of a set of query results as a plurality of 
selectable disjoint subsets in said query results area, 5 
said query processing means coupled to said means for 
generating a query window. 

2. The computer controlled display system as recited in 
claim 1 wherein said one or more search scope indicators of 
said history indic at or area of said query window represents to 
a search scope . at a level in said search path. 

3. The computer controlled display system as recited in 
claim 2 wherein one of said one or more search scope 
indicators of said history indicator area is highlighted to 
indicate the search scope being displayed. 15 

4. The computer controlled display system as recited in 
claim 2 wherein two or more of said search scope indicators 
of said history indicator area are highlighted to indicate 

. search scopes being displayed. 

5. The computer controlled display system as recited in 20 
claim 2 wherein each of said one or more search scope 
indicators of said history indicator area is an icon. 

6. The computer controlled display system as recited in 
claim 2 wherein each of said one or more search scope 
indicators of said history indicator area is one or mare text 25 
symbols. 

7. The computer controlled display system as recited in 
claim 1 wherein said mfbrmation visualization technique is 
a yenn diagram wherein each circle corresponds to one of 
said one or more expressions of said query and the spatial 30 . 
locations and overlaps of said circles define a plurality of 
selectable areas. 

8. The computer controlled display system as recited in 
claim 7 wherein said means for denning a search scope is 
comprised of means for selecting one or more areas in said 35 
venn diagram. 

9. The computer controlled display system as recited in 
claim 1 wherein said information visualization technique is 
a perspective wall 

10. The computer controlled display system as recited in 40 
claim 9 wherein said means for defining a search scope is 
comprised of means for selecting a first point on said wall 
and a second point on said wall. 

11. The computer controlled display system as recited in 
claim 1 further comprising means for displaying a search 45 
path* said search path comprised of a plurality of query., 
windows which- are concentrically displayed so that the 
history indicator area for each- query window is displayed. 

12. The conrputer controlled display system as recited in 
claim 1 wherein said display t means is further for displaying 50 
a history window^ said Jristory? window for displaying a 
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history structure of a search history of said search for 
documents stored in said database. 

13. The computer controlled display system as recited in 
claim 12 further comprising means for adding a node to said 
history structure in response to definition of a search scope. 

14. The computer controlled display system as recited in 
claim 1 wherein said means for defining a search scope is 
comprised of means for selecting one or more of said 
plurality of selectable disjoint subsets in said query result 
area. 

15. On a computer system having a display means and 
coupled to a database, a method for displaying the results of 
a query to said database on said display means, said method 
comprising the steps on 

a) displaying a first query window having a first search 
scope, said first query window having a query entiy 
area, a results display area and a history indicator area; 

b) executing a provided query from a user, said query 
comprised of a plurality of expressions entered in to 
said query entry area of said first query window; 

c) displaying an information visualization of the set of 
results of said query in said results display area of said 
first query window, said information visualization dis- 
playing a plurality of selectable disjoint subsets; 

d) determining that said user has selected one or more of 
said selectable disjoint subsets; 

e) creating a second query window having a second 
search scope comprised of said selected one or more 
disjoint subsets, said query window having a query 
entry area, a results display area and a history indicator 
area; and 

f) displaying said second query window concentric with 
said first query window so that the history indicator 
area of said first query window and the history indicator 
area of said second query window are concurrently 
displayed. 

Id. The method as recited in claim 15 wherein concurrent 
with step a), performing the step of creating a first node in 
a hierarchical history structure. 

17. The method as recited in claim 16 wherein concurrent 
with step e), performing the step of creating a second node 
in a hierarchical history structure. 

18. The method as recited in claim 17 wherein said 
method is further comprised of the steps: 

g) detecting that said user has requested viewing of a 
history window, said history window for displaying 
said hierarchical history structure; and 

h) displaying said history structure in a history window. 
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SYSTEM FOR CATEGORIZING 
DOCUMENTS IN A LINKED COLLECTION 
OF DOCUMENTS 

CROSS REFERENCE TO RELATED 
APPLICATIONS 

The present application is related to commonly assigned 
U.S. patent application Scr. No. 08/836,807 entitled "System 
For Predicting Documents Relevant To Focus Documents 
By Spreading Activation Through Network Representations 
™* J^ nked Collection Of Documents ' U.S. Pat. No. 
5-835.905 which was filed concurrently with the present 
application. 

FIELD OF THE INVENTION 

The present invention is related to the field of analysis and 
design of linked collections of documents, and in particular 
to categorization of documents in said collection. 

BACKGROUND OF THE INVENTION 

Users of large linked collections of documents, for 
instance as manifest on the World Wide Web, are motivated 
to improve the rate at which they gain information needed to 
accomplish their goals. Hypertext structures primarily 
affords information seeking by the sluggish process of 
browsing from one document to another along hypertext 
Jinks. This sluggishness can be at least partly attributed to 
three sources of inefficiency in the basic process. First basic 
hypertext browsing entails slow sequential search by a user *> 
through a document collection. Second, important informa- 
tion about the lands of documents and content contained in 
the total collection cannot be immediately and simulta- 
neously obtained by the user in order to assess the global 
nature of the collection or to aid in decisions about what « 
documents to pursue. Third, the order of encounter with 
documents in basic browsing is not optimized to satisfy 
users* information needs. In addition to exacerbating diffi- 
culties in simple information-seeking, these problems may 
also be found in the production and maintenance of lame «o 
hypertext collections. 

There are two widely visible technologies that may be 
considered broadly as seeking to address the above ineffi- 
ciencies: 

Text-based information retrieval techniques that rapidly 
evaluate the predicted relevance of documents to a 
user's topical query (eg. services such as Alta Vista 11 * 
LycoS™. and InfoseeM which operate on the World 
Wide Web). This effectively changes slow sequential 
search to nearly parallel search, and provides an 
unproved ordering of the users* search ttrough docu- 
ments. - 

Community/service categorization of documents. For 
instance, this service is provided by Yahoo™, which 
has a hierarchy of Web pages that define a topic 
taxonomy. 

Known previous work has focused on attempts to extract 
higher level abstractions which can be used to improve 
navigation and assimilation of hypertext. Such work has 
typically used topological or textual relationships to drive 
analysis. 
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ES^^S? ** ""^g of a web locaiity is 

disclosed. Documents found on the Web are typically 
referred to as Web pages. The system provides ^2 
categorization based on feature vectors that ^hJ,! CcNruee V*n*S 



- ■ — wm anq pi nretntmm of need (a 

re^ance) of other web pages with ^pect to a oa rticu L 
cohliU. which codd bcj^S^^ set cTZjT 

Wes^provnk (horn me User's perspective) oearry-parailel 
search, simultaneous identification of the types of aUdocu- 
ments in a collection . and prediction of expected need. These 
techniques may be used in support of various information 
visuaUzatlon techniques, such as the WebBook described i!, 

^^ g 6^^r?!S: a ^ gnCd Ser- No. 

«J?if tfcd DiS 0*y System For Displaying Lists 
of Lintel J>c^ments-. to form and present la^eraggre- 
gates of related Web pages. Categorization techSuefare 
based on representations of Web pages as feature vectors 
containing information about document content, usage, and 
topotogy. as well as content usage, and topology relations to 
other oocuments. These feature vectors are used to identify- 
and rank particular kinds of Web pages, such as ~organfcaV 
tioa home pages" or "index pages^ g™»- 
Spreading activation techniques are based on representa- 
tions of Web pages as nodes in graph networks representing^- 
usage, content and hypertext relations among Webpa*es! 
Coiiceptiially. activation is pumped into one o r more of the 
ggjU***^^ soma ~+ ^ 

pages (ic f ocal poum)and it flows tniouS the Irf 
me graph structure, with the amount of flow modulated by 
the arc strengths (which might also be thought of as arc flow 
^P**"^ Toe asymptotic pattern of activation over nodes 
wiU define the degree of predicted relevance of We b pases 
set of Web pages, Bv selecting the topmost 
active nodes or those above some set criterion value. Web 
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SUMMARY OF THE INVENTION 

A system for analyzing the topology, content and usage of 63 
linked collections of documents such as those found on the 
World Wide Web (hereinafter the Web) to facilitate infor- 



pages may be aggregated and/or ranked based on their 
predicted relevance. 

BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 is a flowchart illustrating the basic steps for web 
page categorization and relevance prediction as may be 
performed in the currently preferred embodiment of the 
present invention. 

FIG. 2 is a flowchart illustrating the steps for obtaining the 
topology and mcta-informatioa for a web locality as may be 
performed in the currently preferred embodiment of the 
present invention. 

FIG. 3 is a flowchart illustrating the steps for obtaining 
usage statistics, usage path and entry point information as 
may be performed in the currently preferred embodiment off 
the present Invention, 

FIG. 4 is a flowchart for calculating a text similarity 
matrix as may be performed in the currently preferred 
einbodiment of the present invention. 

FIG. 5 is an illustration of a feature vector as may be 
utilized in the currently preferred embodiment of the present 
invention. 

FIG. * is a table showing examples of categories and the 
corresponding feature weightings for the categories as may 
be utilized in the currently preferred embodiment of the 
pre seat invention. 

FIG. 7 is a diagram illustrating the concept of spreading 
activation, as may be utilized in the currently preferred 
embodiment of the present invention. 

FIG. 8 is an illustration of a topology network for a Web 
locality. 
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FIG. 9 is an illustration of a matrix representation of the 
topology network of FIG. ft. 

FIG. It is ao illustration of a text similarity network for 
a Web locality. 

FIG. tl is an illustration of a niatrix refveseutation of the s 
text similarity network of FIG. 1#. 

FIG. 12 is an illustration of a usage path network for a 
Web locality. 

FIG. 13 is an illustration of a matrix representation of the 
usage path network of FIG. IX io 

FIG. 14 is a block diagram illustrating the basic compo- 
nents of a computer based system as may be used to 
implement the currently preferred embodiment of the 
present invention. 

DETAILED DESCRIPTION OF THE 13 
INVENTION 

A system for analyzing the topology, content and usage of 
collections of linked documents is disclosed. The informa- 
tion derived from such a system may be used to aid a user 
in browsing the collection, redesigning me organization of 20 
the collection or in creating visualizations of me collections. 
Hie system provides a means for automatically categorizing 
the pages in the collection and a means for predicting the 
relevance of other pages in a collection with respect to a 
particular Web page using a spreading activation t^niq^ 23 

The currentf y preferred ernbodiment of the present inven- 
tion is implemented for analyzing collections of linked 
documents residing on the portion of me Internet known as 
the World Wide Web (hereinafter the Web). However, it 
should be noted that the present invention is not If mi*"** to ao 
use on the Web and may be utilized in any system which 
provides access to linked entities, including documents, 
images, videos, audio, etc. The following terms defined 
herein are familiar to users of the Web and take on these 
familiar meanings: 

World-Wide Web or Web: The portion of the Internet that is 
used to store and access linked documents. 
Web Page or Rage: A document accessible on the Web. A 
Page may have multi-media content as well as relative and 
absolute links to other pages. 

Wg> Locality: A collect*™ *f r*\~*A n^ft pngn r««~j^"l, 
with an entity having a site on the World-Wide Web such as" 
acomr^^emiCatiChal institute nr m- inST- 
Topology: The logical organization of web pages at a web 
locality as defined by links contained in the individual web 45 
pages. 

Home Page: A page functioning as an entry point to a set of 
related pages on die Web, A home page will typically have 
a plurality of relative links to related pages. 
Uniform Resource Locator (URL): The address or identifier 50 
for a page on me Wcbv . - 

Server: An addressable storage device residing on the Inter- 
net which stores Web Pages- 
Linlc An indicator on a Web page which refers to another 
Web page and which can typically be retrieved in a point and ss 
click fashion. The Link will specify the URL of the other 
Web page. 

Web Browser or Browser: A tool which enables a user to 
traverse through and view documents residing on the Web. 
Other rendering means associated with the Browser will 60 
permit listening to audio portions of a document or viewing 
video or image portions of a document 
Meta- information: Characteristic information for a particu- 
lar Web page, including name, file size, number of links to 
pages in the Web locality, number of links to pages outside 65 
of the Web locality, depth of children, similarity to children, 
etc. 
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Overview 

To best understand the context of the present invention, 
assume a scenario in which a user searches for relevant, 
valuable information at some web locality. The optimal 
selection of Web pages from the web locality to satisfy a 
user's information needs depends, in part, on the user's 
ability to rapidly categorize the Web page types, assess their 
prevalence on the web locality, assess their profitabilities 
(amount of value over cost of pursuit), and decide which 
categories to pursue and which to ignore. The overall rate of 
gaining useful information will be improved by «»Hmi«»j«.-» g 
irrelevant or low-value categories of inf ormation from con- 
sideration. Simply put a user's precious time and attention 
benefits by being able to rapidfy distinguish junk categories 
from important ones. This is improved by the degree to 
which Web pages can be quickly and simultaneously cat- 
egorized. 

Memory systems, whether human or m^i.^ g^p^ me 
purpose of providing useful irfocmation when it is needed. 
In part the design of such systems is adaptive to the extent 
that they can reduce the costs of retrieving the information 
that is likely to be needed hi a given context This, for 
instance, is what memory caches and virtual memory 
attempt to optimize. For contexts involving human, 
cognition, it has been argued that three general sorts of 
i nf ormation determine the need probabilities of inf ormatioa 
in memory, given a current focus of attention: (1) past usage, 
patterns. (2) degree of content shared with the focus, and (3) 
inter-memory associative link structures. The Web can be 
viewed as an external memory and a user would be aided by 
retrieval rneehanisms that predicted and returned the most 
likely needed Web pages, given mat the user has indicated 
an interest in a particular Web page in the Web locality. 

u the present invention a land of spreading activation 
mechanism is used to predict the needed Web page<sX 
computed using past usage patterns, degree of shared 
content and the Web topology. The present invention uti- 
lizes techniques for inducing such information, and for 
approximating the computation of need probabilities using 
spreading activation. Also described is a way of pre- 
computing a base set of spreading activation patterns from 
which all possible patterns can be computed in a simple and 
efficient way (whose cost is proportional only to the number 
of activation sources involved in a retrieval). 

The basic steps for categorizing web pages in a web 
locality and fox predicting relevance of other pages of a 
selected page as may be performed in the currently preferred 
embodiment of the present invention are briefly described 
with reference to the flowchart in HO. 1. First raw data is 
gathered for the web locality, step ML Such raw data may 
be obtained from usage records or access logs of the web 
locality and by di rectjugyersalofth e Web pages in the Web 
legality. As described below7"Agents~ are used to collect 
such raw data. However, it should be noted that the 
described agents are not the only possible method for 
obtaining the raw data for the basic feature vectors. It is 
anticipated that Internet service providers have the capabili- 
ties to provide such raw data and may do so in the future. 

In any event the raw data is then processed into desired 
formats for p erfor m ing the categorization (feature vecto rs) 
and relevance prediction (topology, usage path and text 
simuarity maps), step 1*2. The raw data is comprised of 
topology information, page meta- information, page fre- 
quency path information and text similarity information. 
Topology information describes the hyperlink structure 
among Web pages at a Web locality. Page ineta-infonnation 
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^nes vanousfcaturci of the pages, such as file size and 
URL Usage frequency and path information indiaSfchow 
many tones a Web page has been accessed aad how many 
times a traversal was made from one Web page to another. 
Text similarity information provides an indication of the 
similarity of text among ail text Web pages at a Web locality. 
"\ the classification of \\£b pages in the web locality, 
classification characteristics are provided, step 1*3. The 
classification characteristics are predetermined "rules" 
which are applied to the feature vectors of a page to 
de tenrnnc the category of the page. For example, it may be 
desirable to have a classification of web pages as index types 
(contain primarily links to other pages) or content types 
(contain primarily information Tie classification charac- 
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SS 11100 - repress ^ 

Topology and McU-inforautioa 

The site's topology Is ascertained via "the water anTl 
autonomous agent that, given a staitiag point, performs 2 * 
exhaustive bce.nh.nm traversal of page* witfStt^^ 
to-toy. HG. 2 is , flo^ Oluvn&Z 

Xj*."***' Referring to FIG. 2. the wator^B«Ae 
Hypertext Transfer Protocol (HTTP) to reque^d^^ 
a web page^step 2»1. The walker nay also £ aWe toac££ 

The returned page*, then parsed to extract hyperlinks to 
oAer pages, step 2#Z Link, that point to pages^fctothe 



As noted above with respect to step 1»Z topology, usage sEcTina the time Ae^^ vwu^sTm^^^-^^ 

path and text similarity mans of the i~4i£/?L » P»gc waa last modified. The page i, 

erated from the raw data. 

of association among web pages in the locality. The topol 



-espect to step in. topology, usage the time dtcraTwaa last m^A^iZ~k^- 

maps of the web locality are gen- 10 then aajea to a topology malnx. step itn gl^fr 

-These map, represent the strength jBStreCKfttiEra*. page to wan kvW^ JZll.™ 

* pages in the locality. The topol- a set 61 mcU-inh^^^lJ^ ~^T^nm^^~ 



ogy map indicates the hyper link structure of the web 
locality and are used to perform the relevance prediction. 
The usage path map indicates the flow or paths taken durmn 
traversal of the web locality. The text similarity map indi- 
cate* similarity of content between pages in the weblocality. 
These maps are used perform the relevancy predictions. 

For relevancy predictions, one or more Web pages for 
spreading activation are selected, step Its. The selected 
Web pages may be based on the category that it is in. 
Alternatively, if a user is currently browsing the pages in the 
web locality, the selected page may be the one currently 
being browsed. In any event, activation is spread using the 
selected page as a focal point to generate a list of relevant 
pages, step 106. Generally, activation is pumped Into one or 
more of the maps at the selected Web pages and it flows 
through the arcs of the maps, with the amount of flow 
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^ei^oTpagestorequest and retrieve is then usedtooSain 

2W^ aU of the pages on the list 
wJ^.. Wa4kBr P roducc « * graph represetitation of the~ 
hyperlink Picture of the Web localn^wim^ nod^ 
having at least the above described nieta-iiiforrnation. ft is 
sa^nt to note that the walker may not have reached 1 all 
nodes that are accessible via a particiilar server-^ 
nodes that were reachable from the starting point (e.g. a 
Home Page fee the Web locality) are indiixIcV^ can be 
a£ev.atedr>ywalJ^ 

Usage Statistics. Usage Paths, and Entry Points 

Most servers have the ability to record transactional 
^formation. i.e access logs, about requested items. This 



Z^ i JZLu mc m * ps ' Wlth of flow fnxonnanon. i.e access logs, about requested items. This 

modulated by the arc strengths (which might also be thought information usually consists of at least me time and the name 
of as arc flow capacities). Review activation results to find 40 <* *e URL being requested as well as the iiiachine iiame 
relevant pages, step 1*7. The asyniptotic pattern of activa- maW »8 "quest The latter field may represent only one 
T!^** J m jhe maps (ic, Web pages) will define the uscr makm « requests from their local marine or it could 
"™ ^ "* represent a number of users whose requests are beug issued 

through one niachiae, as is the case with firewalls and 
45 proxies. This makes differentiating the paths traversed by 
individual users from these access logs non-triviaL since 
Dumerous requests from proxied and firewalied domains can 
occur simultaneously. That is. if 200 users from behind a 
proxy are simultaneously navigating the pages within a site 
■* — — j — ■ » - - - ..... 



degree of predicted relevance of Web pages to the selected 
set of Web pages. By selecting the topmost active Web pages 
or those above some set criterion, Y fll,r Web pages may be 
ranked based on their predicted relevance. Subsequent tra- 
versal may then be performed based on the identified 
relevant Web pages* 

Compiling the Raw Data for a Web Locality 



_-r 9 — «-~ *- rw - ■ "w auuuuuy « 7 — ~ «-wuiw«uMjr uavigaang me pages within a site 

Three basic lands of iw data are extracted from a Web 30 how dees one deteniiiiKwhjU*^ 

cality: Droblem k ftirtiw w.. i . rTrV . 



locality: 

Topology and metannformation. which are the hyperlink 
structure among Web pages at a Web locality and 
various features of the pages, such as file size and URL. 
Usage frequency and usage paths, which indicate how 
many times a Web page has been accessed and how 
many times a traversal was made from one Web naze 
to another. 



As described mentioned above with respect to FIG. 1. the 
raw data is used to construct two types of representations: 
Feature-vector representations of each Web page that 
represent the value of each page on each dimension and 
i, ^yhich are used in the categorization process 

!^l^J?^£csentatiQai of the s treggr h of association of 
- Wco P*8 es jo one another? which are used in the 



problem is further complicated by local caches m*.w»i»^ 
by each browser and intentional reloading of pages by the 
user. 

The technique implemented to determine user's paths 
55 aJca. "the whitter*. utilizes the Web locality's topology' 
along with several heuristics. FIG. 3 is a flmvehan moat- 
ing the steps rxrforroed to (ieterraiiie user paths. First, a user 
path is obtained from the web locality access logs, step 341 
The topology matrix is consulted to determine legitimate 



Text siniilarity amoDg all text Web pages at a Web locality ^ t0 P^°8y » consulted to determine legitimate 

As described mentioned above with respect to FIG. 1. the ^versals. It is then detenniiied if there are any ambigutries 



65 



Wlth rcs P cct to *e user path, step 3#X As described above 
such ambiguities may arise in the situation where the requ est 
is from a proxied or firewalied domain. If an ambiguity is 
suspected, predeterniined heuristics are used to disambigu- 
ate user paths, step 303. The heuristics used relies upon a 
least recently used bin packing strategy and session length 
time-outs as deterniined empirically from end-user naviga- 



5.895 ? 470 



7 

tion patterns. Essentially, new paths are created for a 
machine name when the time between the last request and 
the current request was greater than the session boundary 
limit Lc. the session timed out New paths are also created 
when the requested page is not connected to the last page in 
the currently maintained path. These tests are performed on 
all paths being maintained for that machine name, with the 
ordering of tests being the paths least recently extended. The 
foregoing analysis produces a set of paths requested by each 
machine and the times for each request 

From the set of paths, a vector that contains each page's 
frequency of requests is generated (Le. a frequency vector), 
step 384. along with a path matrix containing the number of 
traversals from one page to another, step 565. In the cur- 
rently pi ef cu e d embodiment the matrix is computed using 
software that identifies the frequency of all substring com- 
binations for all paths. 

Additionally, the difference between the total number of 
requests for a page and the sum of the paths to the page is 
computed to generate a set of entry point candidates, step 
3+6. The entry point candidatrt are the Web pages at a Web 
locality that seem to be the starting points for many users. 
Entry points are defined as the set of pages that are pointed 
to by sources outside the locality, e.g. an organization's 
home page, a popular news article, etc. Entry points might 
provide useful insight to Web designers based on actual use. 
which may differ from their intended use on a Web locality. 
Entry points also may be used In providing a set of nodes 
from which to spread activation. 
Inter-document Text Sirnilarity 

Techniques from information retrieval can be applied to 
calculate a text similarity matrix which represents the inter- 
document text sim Clarities among Web pages. In particular, 
for each Web page, die text is tokenized and indexed using 
a statistical content analysis process. An SCA engine pro- 
cesses text Web pages by treating their contents as a 
sequence of tokens and garnering collection and document 
level object and token statistics (most notably token 
occurrence). A contiguous character string representing a 
word is an example of a token. So in the currently preferred 
embodiment of the present invention, the Web pages in a 
Web locality are processed by the SCA engine to yield 
various indexes and index terms. A suitable process for 
analysis and token fraf ion of a collection of documents (or 
database) is described in section 5 of a publication entitled 
"An Object-Oriented Architecture for Text Retrieval*. Doug 
Cutting. Jan Pedersen* and Per-Kristian Halvorsen. Xerox 
PARC technical report SSL-90-S3. 

FIG. 4 is a flowchart describing the steps for generating 
a text similarity matrix. Referring to FIG. 4, a suitable SCA 
engine is used to *r>t*niT*> * web page, step 401. Token 
statistics for die web page are then generated, step 4*Z 
These statistics include token occurrence. The token infor- 
mation is men used to create a document vector, where each 
component of the vector represents a wont step 4#3. Entries 
in the vector for a document indicate the presence or 
frequency of a word i n the document The steps 46 1-463 are 
repeated for each Web page in the Web locality. For each 
pair of pages, the dot product of these vectors is computed. 
ct^P AkA fg T ^ pr^ict "frith frmlu ^s a simjjar itv 
measure. Tfr * similarity mftwure t* then; entere d into the 
ap propriate location of the text similarity matrix for the Web 
jocajjrjk step 44S. 

The currently preferred embodiment further provides a 
method for computing a "desirability index for each Web 
page that "ages" over time. Using this, one can predict the 
number of hits a page will receive. What may also be 
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provided is a "life-cfaaage" index, that also "ages" over time, 
that predicts the likelihood of Web pages being altered. 
Categorization of Web Pages 

In order to perform categorizations each Web page at the 
Web locality is represented by a vector of features con- 
structed from the above topology, metainfocmatioa, usage 
statistics and paths, and text similarities. These Web page 
vectors are collected into a matrix. Such a matrix is illus- 
trated in FIG. 5. Referring to FIG. 5. each row 561 of the 
matrix 50+ represents a Web page. Hie columns in matrix 
5*6 represent a the page's: 

page identifier, identifies the particular web pass (column 
502) 

size, in bytes, of the item (column 503) 

inllnks. the number of hyperlinks that point to the item 

from the web locality (column 504). 
outlinks. the number of hyperlinks the item contains that 
point to other kerns in the web locality (column 505). 
frequency, the number of times the item was requested in 

the sample period (column 566). 
sources, number of tunes the item was Identified as the 

start of a path traversal (column 567). 
csim. the textual sumlariry of the item to it's children 

based upon previous SCA calculation (column 568). 
cdepth. the average depth of the item's children as mea- 
sured by the number of T in the URL (column 569). 
Note that the means and distributions of the feature 
values are normalized. 
The present invention assumes that categories are 
designed by someone (application designer, webmaster, end 
user), in contrast to being Jtitfomatiralry induced. These 
categories might be. for instance, socially defined genres 
(personal home page; product description), or personally 
defined categories of interest. 

The present invention utilizes an approach based on 
weighted linear equations that define the rules for predicting 
degree of category membership for each page at a web 
locality. That is. equations are of the form 

for all pages 1 in a Web locality, where the vj are the 
measured features of each Web page, and the wj are weights. 
Example of Categories 

Categorization techniques typically attempt to assign 
individual elements into categories based on the features 
they exhibit. Based on category membership, a user may 
quickly predict the functionality of an element. For instance, 
in the everyday world, identifying something as a "chair- 
enables the quick prediction that an object can be salon. The 
techniques described herein will thus rely on me particular 
features that can be extracted about Web pages at a Web 
locality. 

One may conceive of a Web locality as a complex abstract 
space in which are amnged Web pages of different func- 
tional categories or types. These functional categories might 
be defined by a user's specific set of interests, or the 
categories might be extracted from the collection itself 
through inductive technologies (e.g. Scatter/Gather tech- 
niques as described by Cutting, et aL in a publication entitled 
"Scatter/gather A cluster-based approach to browsing large 
document collections**. Proceedings of 5/G//T92. Jun. 
1992.). An example category might be organizational home 
page. Typical members of the category would describe an 
organization and have links to many other Web pages, 
providing relevant information about the organization, its 
divisions or departments, summaries of Its purpose, and so 
on. 
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In the currently preferred embodiment a set of functional 
categories is defined. Each functional category was defined 
in a manner that has a graded membership, with some pages 
being more typical of a category than others, and Web pages 
may belong to many categories. FIG. 6 is a table illustrating 
the Web categories defined in the currently preferred 
embodiment of the present invention: 
head 601: Typically a related set of pages will have one 

page that would best serve as the first one to visit Head 

pages have two subclasses: 

organizational home page 6#2: These are pages that 
represent the entry point for organizations and 
institutions, usually found as the default home page 
for servers, e.g.. http://www.org/ 

personal home page 403: Usually, individuals have 
only one page within an organization that they place 
personal information and other tidbits on. 
index 604: These arc pages that server to navigate users 

to a number of other pages that may or may not be 
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Spreading ^activation can be characterized as anrocesi that 

As noted above, the raw data provided by the web agents 
!*? I??*"** "*° structures lepresentinf the 

(a) link topology, (b) usage flow, and (c) intapagTtou 
similarity. The spreading activation technique used for rel- 
evance jndietion assumes that one may identify a pattern of 
input activation Hut represents a pattern or focus of atten- 
tion. For instance, t he focus mav^T^rife. r ^ r ~ 
a prototype of a category! Activation from this focus pointf-p 
sgg* qg"* * on* m ore or the th ree graphs andS" 
«a«rvam m^.*H* Patent or acfivXn ^W lS, 
noaet^nc activation values are assumed' to be the predicted 
is re . 1 « v * acc «• ** input focus (or the probability that a page 
15 win be needed given the pages in tbVinput foe** 

Spreading activation across the networks is d«rrit~i 
c^ptualry with reference to Flfl. j. Ref^no^Tr^V 
a^QW^ITpBtnped into one or more of rnVgrapfc 



10 



v : ™ w t>°*^ u«i iuay or may not oe networks 702 at nodes rerxesenttaft some stuHi7» 
^T^^lr^ ,WS CUeg<Xy haVe the words 20 focus Web pages. The acSS XL TSonAAe'arcs cJ 
Index or Table of Contents" or **toc~ as nut «v rh^iv the eraoh anictum urTflhTi iw. -~ ~* u ^ _ v t._- ■ ? « 



"Index 1 
URL. 

source index 605: These pages are also head nodes, those 

that are used as entry points and indices into a related 

formation space, 
reference 6*6: A page that is used to repeatedly explain a 

concept or contains actual references. References also 

have a special subclasses: 

destination reference 607: In graph theory these are 
best thought of as "sinks'*, pages that do not point 
elsewhere but that a number of other pages point to. 
Examples include pages of expanded acronyms, 
copyright notices, and bibliographic references, 
content 66$: These are pages whose purpose is not to 

facilitate navigation, but to deliver information. 
FIG. 6 further shows the weights used to order Web pages 
for each of the categories outlined above. For example, it is 
hypothesized that Content Pages would have few iniint^ and 



. « ' ™ — Huwgu me arcs or 

the graph structure. wToTmc amount ol flow raodugecTby 
the arc strengths (which might also be thought of as arc flow 
capacities). The asymptotic pattern of activation over node* 
as iH^ra^ Mri M coined m mTnod£ 
25 g wory will denne the de gree of predJ ct^rejevanrg n* 
Weo pages to the star^iT S oi locus WeB j SS5TT& 
selecting the topmost active nodes or those abovelou set 

cntenoo value. Web pages are extracted w ~-*~4 1 f n „ 

their pre d icted relevance^ 

The particular technique described has the property that 
the activation patterns that result from midtiple 
are just additive conibtnations of the activation patterns 
produced by each of the sources indwdtialry (multiple 
weighted sources are just weighted additions of the indi- 
vidual sources). Using this property, one may precompute 
the activation patterns that arise from each source contained 
with each graph. All complex patterns can be derived from 
these by simple vector addition. In addition, the activation 
values arising in each activation pattern can be comtnned 
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few outlinlcs. but have relaUvely Urger hie si^o^ «, ^e^g^^T 

c™^^^^ P-fe- embodiment me activation 

weigV-1. o7the m*j£Ltol%lL" Fo?E T^lf^ ^ U ' capacitor model 

Nodes (classification criteria Ml), being thefirs. ~ ToTa *? , f^?** "* P L VinaL « ^ S f md of 

collection of documents with lite c3t. U ii ao^S ^ai a* ^ tivaUoB • J ° u ™» <4 Experiment Psychology: Leaning, 
such pages will have high text similarir^ b^weTn^ ^ " ZZ?."? P*""*' » 'P 1 "! 98 aad by 
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such pages will have high text similarity between itself and 
its children, and would have a high average depth of its 
children, and that it would be more likely to be an entry point 
based upon actual user navigation patterns. 

It is noted that sometimes categories are formed which 
cannot be captured by such rules (ic. the rules assume 
linearly separable categories and people sometimes form 
categories that are- nor linearly separable). However, the 
approach of the currently preferred embodiment has the 
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advantage of being easy to compute and having su™te 55 ^SLlfC^ - ° f ° CtW ° dB OT is 
combinatoncs. Tluf n^L that ( JXe cCd^S * ^^^h^^J^^^ 



combinatorics. This means that (a) the rules could be easily 
defined by the average end-user, (b) that membership in all 
core categories can be precomputed and stored as another 
feature on the feature vector (a computed feature as opposed 
to a basic feature) and (c) membership in a mixture of 60 
categories is just another weighted linear equation in which 
the features are categories. 

Relevance Prediction Through Spreading Activation 
With the above information, various predi ctions can be 63 
made as to pages relevant to a particular page. T he "spread- 
ing activation'* technique is used to make such a prediction. 



Hubermaa. B. A. and T. Hogg, in "Phase Transitions In 
Artificial Intelligence Systems". Artificial Intelligences, op 
155-171(1987). 

Networks for Spreading Activation 

As outlined above, three kutd of graphs, or networks, are" 
used to represent strength of associations among Web pages* 
( 1) the hypetext link topology of a Web locality. (2) inter- 
page text similarity, and (3) the usage paths, or flow of users 
through the locality. Each of these networks or graphs is 



nthm. That is. each row corresponds to a network node 
representing a Web page, and similarly each column corre- 
sponds to a network node representing a Web page. If we 

index the I. 2 N Web pages, there would be i»l. 

2 N columns and j=l, 2. . . . N rows for each matrix 

representing a graph network. 

Each entry in the i* column and j* row of a matrix 
represents the strength of connection between page i and 
page j (or similarly, the amount of potential activation flow 
or capacity). The meaning of these entries varies depending 
on the type of network through which activation is being 
spread. 
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? ^^f^tl^ °°- ^"T 8 to <W«* of C give. apoSive value andme^maX 

vectoes- M1-9Q7. Tteaiaute graph indicate links the usage path matrix. Equation 2 above is iter««t f«, m 

between the various pages^Refetriag now to FIG. 9. for the 3 time ^ (e.g. N-10 haT^Tvided «c£^ re^> 

matrix representation in topology networks, an entry of 0 in most visited pages are theVmoTh^£^ Si 2?* 

column L row j. indicates no hypertext link betw«£ page i vation. AH^SwSZZZS^^tt^ 

*&S££X£E3flt "^/^"^ ^ exceed some prL^edStSTd^noii In 
SoO^SC^ ^L^T ^ 10 P* 8 " event a Web aggregate has bee. identified. * 

«P*-8w4 by the entry of 1 in the corresponding positions of to r..~< ™ it,;. ,n^™.-:» _ 

the topology matrix. »i~«««u»ui i« Based on this information traversal patterns can be deter- 

FIGS. 10-11 iUusira te a text similarity network and "^'t^tj^^ me 1 moI, »yp« of information 

c orresponding matrix icwtsetiadon.l BnnTgn m ^esteA^ u external U1 er entering a companies home 
1A , )A the widfa of the lm^^nn^a ,t^!?^.,. "^1: W nuybe looking at the companies product, or financial 
VW* lWCTl»ns^1n^^ 15 San^nJ^^Lir 81 * ^ ^ ^ person 

W>,DP£>©^ ggcrring now to FI G. ll. for the matrix represe^on^ "veZ^ ^* ^ P 0 *** 1 CUSt ° mc " « 

text similarity networks, an entry of a real number. s>=0. in 
UijJGS column i. row j indicates the inter-doc umcnl similarity of Example 2 

page i to page j. 

FIGS. 12-13 illustrate usage path network and corre- 20 Assessing the Typical Web Author at a Locality 
sponding matrix representation. Referring to FIG. 12. it Consider another situation in which the VIM. 
should be noted that there will only be usage between nodes interest are J^L^Z^^^^^L 

, w ^ P ^ R£faTlngnOWtoFIG - one might be interested in understanding s<>nx^a^out 

lXfor the matrix representation of usage path networks, an 23 what a typical person publishing in a Weblocalitv savs about 
-!?L H T" 8 *- 0-01 * °° lumn * * themseJveTlta iftis casTme^ost typto£%^Z£?£ 

iKUcates the number of users that traversed from page i to identified using the T^naTlsoW^Pa^cXi" 

page j. -* — — '* — * s — ~ — 



described in FK3. ». the corresponding C element set to 

, ™~ . positive activation input (zeros elsewhere), and R la net to 

An actfvation network can be represented as a graph 30 the text similarity matrix. Iteration of this spread cf activa- 

defined by matrix where each off-diagonal element Rij tioo far N=10 time steps seleeU n collection of Webpanet 

ZT^J ^JZ^f- usoaatio t^ xweea i aodj. By reading the group project overviews, the home paget of 

JT 8 ! . T ^ 5^££^aaHUO£ related people, personal interest pages, and formal and 

howjnuc^actrvahon flows from nodcJuifl^Thc set of informal group, to which the persaTbrfongs. onTmay get 

source nodes ot activation being punyea TLUiThc ne t wo rk is 33 some sense of what people ateUkTin the orea^LSoa. 

represented by a vector C. where Ci represcoU the activation Combining Activation Nets 

ptmiped in by node L The dynamics of activation can be Because of the simple properties of the activation 

modeled over discrete steps n=l. 2 N. with activation at networks, it is easy to^omTinTme spread of activation 

step t represented by a vector A(t). with element Aft i) though any weighted combination of activation pumped 

representing the activation at node i at step t. The time «o from diflereut sources and through different kinds of arSHf 

evolution of me flow of activation is tennincd by that is. simultaneously through the topology, usage, and text- fl— 

— - *— * S^rom^d^S^n^l^oSi 

where M is a matrix that determines the flow and decay of ? f P™^*** relevancy, For Instance one might be interested \ 
activation among nodes. It is specified by 45 in the identifying (be pages most similar in content to the 11 

pages most popularly traversed. ^fl 
AMt-s-y+fl * Eotsuoo 3 VisuafizatioD*- — 



. t . _ a . . Most^auxent Web browsers provide very little surjoort for 

^Jl 1 1* pmm *~ «UxaUon of node helping people gain an <iveraU^is^sment *e1£u^ 

activity to zero when it receives no additional active *> and ccW<rf lar^ coUections of Web pages. Inform^ 

ton input, anda u a pa»meter denoting the amount of Visualization could be used to provide aniateractivc over 

T V £™ <P^f™ * ^ to ^ I " the view of web localities that t^ate, tuvigan^a^eneral 

laenary matrix. assessment. Visualizatioiis have been developed that provide 

Ex __ le - new interactive mechaitisms for making sense of informa- 

55 tioo sets with mousaads of objects. The general approach is 

Predicting the Interests of Home Page Visitors tomapprop^ruesandreiatiomof large , 

onto visuaL interactive souctures. 
To illustrate, consider the situation in which it is desirable To the extent that the properties that help users navigate 

to identify the most frequently visited organization home around the space and remember locations or ones mat 

page using the categorization information, and construct a 60 support the unit tasks of the user's work, the visualizations 

Web aggregate mat contains the pages most visited from that provide valjie to the use r. \flruaUzations can be applied to the 

page. The most popular arganizatioi page can be identified Web by treating the pages of the Web as objects with ' 

by first ftncling the pages in that category using the dassi- properties, Each" of these visuahzattoas provide an ovovtew ' 

fication criteria described in FIG- 6 (Le. the "Organization bt srweb locality in terms of some simple property of the 

Home Page" criteria). The most popular page would then be 63 pages. For example, the present invention may be used in 

the identified page having the highest **rrcquency" value in support of information visualization techniques, such as the 

their corresponding document vector. To find the most WebBook described in co-pending and commonly assigned VUcbSco^ 
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application Set: Not 08/525,93$ entitled "Display System 
For Displaying Uses of Linked Documents**, to form and 
present larger agpegates of related Web pages. Other 
examples include a ConeJEcee which shows die connectivity 
structure between pages and a PerspectiveLWaU which shows 
time-indexed accesses of the pages. The cone tree is 
described in U.S. Pat. No. 5.295.243 entitled "Display of 
Hierarchical Three-Dimensional Structures With Rotating 
Substructures". The Perspective Wall is described in U.S. 
Pat No. 5339390 entitled "Operating A Processor To 
Display Stretched Continuation Of A Workspace**. Thus, 
these visualizations are based on one or a few characteristics 
of the pages. 

Overview of a Computer Controlled Display 
System in the Currently Preferred Embodiment of 
the Present Invention 

The computer based system on which the currently pre- 
ferred embodiment of the present invention may be impie~ 
mented is described with reference to FIG. 14. The computer 
based system and associated operating instructions (eg. 
software) embody circuitry used to implement the present 
invention. Referring to FIG. 14. the computer based system 
is comprised of a plurality of components coupled via a bus 
14*1. The bus 1401 may consist of a plurality of parallel 
buses (e.g. address* data and status buses) as well as a 
hierarchy of buses (e.g. a processor bus. a local bus and an 
VO bus). Id any event, the computer system is further 
comprised of a processor 14*2 for executing i-wmtppt 
provided via bus 14*1 from Internal memory 14*3 (note that 
the Internal memory 1403 is typically a combination of 
Random Access and Read Only Memories). The processor 
1402 will be used to pe rf o xm various operations in support 
extracting raw data from web localities, convening the raw 
data into the desired feature vectors and topology, usage path 
and text similarity matrices, categorization and spreading 
activation. Instructions for performing such operations are 
retrieved from Internal memory .1403. Such operations that 
wouTcTbe p erfor med by the processor 1402 would include 
the processing steps described in FIGS. 1-4 and 7. The 
operations would typically be provided in the form of coded 
instructions in a suitable pnj mi ming language using 
weliknown programming techniques* The processor 1402 
and Internal memory 1403 may be discrete components or a 
single integrated device such as an Application Specification 
Integrated Circuit (ASIC) chip. 

Also coupled to the bus 140ft are a keyboard 1404 for 
entering -alphamimeric input; external storage 1405 for 
storing date* a cursor control device 1400 for m*nip»i»*f"g 
a cursor, a display 1407 for displaying visual output (eg. the 
WebBook) and a network connection 140ft The keyboard 

1404 would typicalryiK a standard QWERTY keyboard but 
may also be telephone like keypad. The external storage 

1405 may be fixed or removable magnetic or optical disk 
drive. The cursor control device 1406. e.g. a mouse or 
trackbalL will typically have a button or switch associated 
with it to which the performance of certain functions can be 
programmed. The network connection 1408 provides means 
for attaching to a network, e.g. a Local Area Network (LAN) 
card or modem card with a ppropr i ate software. The network 
ultimately attached to is the Internet, but it may be through 
intermediary networks or On-Line services such as America 
On-Line. Prodigy ™ or CompuServ ™. 

Thus, a system for analyzing a collection of hyper-linkcd 
pages is disclosed. While the present invention is described 
with respect to a preferred embodiment, it would be apparent 
to one skilled in the art to practice the present invention with 
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other configurations of digital document management sys- 
tems. Such alternate embodiments would not cause depar- 
ture from the spirit and scope of the present invention. For 
example; the present invention may be implemented as 
software instructions residing on a suitable memory medium 
for use in operating a computer based system. 
What is claimed is; 

1. A system for categorizing documents contained in n 
linked collection of documents comprising: 

means for obtaining raw data from said linked collection 
of documents* said raw data including meta informa- 
tion for documents in said linked collection of docu- 
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means for creating a feature vector for documents in said 
linked collection of ^lmmu from said raw *»*9 , said 
feature vector comprising a plurality of elements; 
means for defining classification criteria ifw«r**?» g par. 
ticular categories of document types, said classification: 
criteria comprising user defined weightings of the de- 
ments for said feature vector and a corresponding class 
threshold value; 
processing means for applying said classification criteria^ 
to feature vectors to determine if a document is in *. 
corresponding category. 

2. The system as recited in claim 1 whmtii uM mHM fry 
obtaining raw data for said linked collection of documents is 
further comprised of a first agent for traversing said linked 
collection of documents to obtain topology information and 
document meta information. 

3. The system as recited in claim 2 wherein the pktrality 
of elements of a feature vector for a document in said linked 
collection of documents include: 

size information for said document; 

inlink information for said document, said inlink infor- 
mation indicating the number of links in said 'infcird 
coUectioo of documents that point to said document; 

outlink information for said document, said oudink infor- 
mation indicating the number of links the 
contains to other documents said linked coUectioo of 
documents; 

frequency informatk>a for said document, said frequency 
information indicating the number of times said docu- 
ment was requested during a sample period; 

source information for said document, said source infor- 
mation indicating the ntimh**y of times said <*m»rn+ni 
was identified as the start of a path traversal; 

text similarity information for said document said text 
similarity information tii»g the similarity of the 
text of the document to documents in said linked 
collection of document to which they are linked; and 

depth information for said document* said depth informa- 
tion indicating the average depth in said linked collec- 
tion of documents of documents to which said docu- 
ment links. 

4. The system as recited in claim 3 wherein said process- 
ing is comprised of means for determining that a document 
is a class if after applying said classification criteria the 
result exceeds said coiresponding class threshold value. 

5. The system as recited In claim 1 wherein said linked 
collection of documents is a Web locality. 

6. A method for generating a list of web pages in a web 
locality that are contained in a user defined class comprising 
the steps of: 

a) obtaining raw data for said web locality, said raw data 
including topology informatioo and web locality usage 
information; 
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b) generating page meta data for each web page in said 
web locality from said caw data; 

c) geaerating feature vectors for each web page in said 
web locality using said page meta data and said topol- 
ogy information* said feature vector comprised of a 
plurality of elements; 

d) obtaining a classification criteria for determining if a 
web page is a member of a category of web pages, said 
classification criteria comprising user defined weight- 
ings of the plurality of elements for said feature vector 
and a corresponding class threshold value; and 

e> applying said classification criteria to said feature 
vectors to obtain a list of pages in sakt category. 

7. The method as recited in claim 6 wherein said step of 
obtaining topology information for said web locality is 
comprised of the steps of: 

al) retrieving a web page; 

a2) storing location information for said web page; 
a3) parsing said web page to identify links to other web 
pages; and 

a4) repeating steps al>-a3) for each of said other web 
pages. 

& The method as recited in claim 6 wherein said step of 
obtaining page meta data for each web page in said web 
locality is further comprised of the step of collecting page 
meta data for a page as the page b retrieved. 

% The method as recited in claim 6 wherein said step of 
generating feature vectors for each web page in said web 
locality using sakt page meta data and said topology infor- 
mation is further comprised of the step of for each associated 
web page in said web locality performing the steps of: 
extracting size information for said associated web page 
and storing as a size element in said corresponding 
feature vector; 
extracting inlink information for said associated web 
page, said inlink information indicating the number of 
links in said web locality that point to said associated 
web page as storing as an inlink element in said 
corresponding feature vector 
extracting outlink information for said associated web 
page, said outlink information indicating the number of 
liwlcK the web page contains to other web pages in said 
web locality and storing as an outlink element in said 
corresponding feature vector; 
extracting frequency information for said associated web 
page, said frequency information indi c ati ng the number 
of times said associated web page was requested during 
a sample period and storing as a frequency element in 
said corresponding feature vector; 
extracting source' information for said a ss ocia t ed web 
page, said source information indicating the number of 
times said associated web page was identified as the 
start of a path traversal and storing as a source element 
in said corresponding feature vector; 
extracting text similarity information for said associated 
web page, said itext similarity information indicating 
the similarity of the text of the associated web page to 
other web pages in said web locality to which they are 
linked and storing as a text similarity element in said 
corresponding feature vector; and 
extracting depth information for said associated web 
page, said depth information indicating the average 
depth in said web locality of other web pages to which 63 
said associated web page links and storing as a depth 
element in said corresponding feature vector. 
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If. The method as recited in claim 9 wherein said step of 
applying said classification criteria to said feature vectors to 
obtain a list of pages in said category is further comprised of 
the steps of: 

for each element of a feature vector applying a corre- 
sponding weighting value to obtain a feature value; 
summing the resulting values feature values; and 
comparing said sum to said class threshold value to 
determine if said cor re sp onding page is in said class. 
11. A system for generating characteristic data for a linked 
collection of documents comprising: 

means for obtaining raw data for said linked collection of 
documents, said raw data including usage data, topol- 
ogy data and content data; 
means for creating a feature vector for each document in 
said linked collection of documents from said raw data; 
and 

means for categorizing each of said documents in said 
linked collection of documents according to predeter- 
mined classification criteria, said predetermined cias~ 
sification criteria comprising user defined weightings 
of the element* for said feature vector and a corre- 
sponding class threshold value* 
1Z The system as recited in claim 11 further comprising: 
means for creating usage, topology and text similarity 
maps for said linked collection of documents from said 
raw data; 

means for predicting a relevant set of documents for a 
subset of said linked collection of document * using one 
or more of said usage, topology and text similarity 
maps. 

IX A system for categorizing documents contained in a 
linked collection of documents comprising: 
means for obtaining raw data from said linked collection 
of documents, said raw data including meta informa- 
tion for documents in said linked collection of docu- 
ments; 

means for creating a feature vector for documents in said 
linked collection of documents from said raw data, said 
feature vector having at least one element indicating a 
frequency of request for an associated document; 
means for defining classification criteria indicating par- 
ticular categories of document types; 
processing means for applying said classification criteria 
to feature vectors to determine if a document is in a 
corresponding category. 
14. A method for generating a list of web pages in a web 
locality that are contained in a user defined class comprising 
the steps of: 

a) obtaining raw data for said web locality, said raw data 
including topology information and web locality usage 
information; 

b) generating page meta data for each web page in said 
web locality from said raw data, said meta data includ- 
ing data indicating a frequency of request for an 
associated document; 

c) generating feature vectors for each web page in said 
web locality using said page meta data and said topol- 
ogy information; 

d) obtaining a classification criteria for determining if a 
web page is a member of a category of web pages; and 

e) applying said classification criteria to said feature 
vectors to obtain a list of pages in said category. 



